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1, INTRODUCTION 

Font digit recognition is an improvement of naturally perceiving handwritten character and numerals 
by machine or personal computer and it is also a part of character recognition technology. This improvement 
can be connected to program recognizable proof of computer character information for example in medical 
documentation [1], bank check processing and postal mail sorting [2]. Since there exists various digitized 
characters, it is difficult to effectively distinguish a great deal of computer written character figures because 
of the thousands of numerals font types. With the quick advancement of worldwide data and the expanding 
request of technology, the use of digitized character advanced acknowledgment is earnest. Various 
techniques have been explored for font digit recognition such as Nearest Neighbor Calculation [2] and Neural 
Network [3]. The scientific capacity of complex grouping issue and the speculation capacity of systems are 
restricted, and high acknowledgment exactness cannot be accomplished. But with the advancement of deep 
learning technology and the rise of the Convolution Neural Network (CNNs) gives the likelihood to take care 
of this issue. 

CNN is part of deep learning that utilizes mostly to group images, cluster them by likeness, and 
perform picture acknowledgment within the scenes. CNN has shown amazing enhancement in various object 
recognition such as iris recognition [4], traffic sign recognition [5-6], face recognition [7-9], fruit recognition 
[10], leaf recognition [11] and font recognition [12]. The common structure of a CNN comprises of layers of 
neurons. A neuron takes input values, does computations and passes the result to the next layer. Each model 
has its own layers of convolution and computational complexity. Modifications can be made to the 
architecture of CNN to increase its performance by applying bi-linear CNN _ [13] and 
multi-maxpooling layers [14]. 
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Apart from CNN, another popular technique for object recognition is Bag of Features (BoF). BoF is 
a machine learning technique that represents an image as orderless collections of nearby highlights [15]. The 
term BoF comes from the term Bag of Words (BoW) utilizes in textual information recovery. BoF with 
Speeded-Up Robust Features (SURF) with Support Vector Machine (SVM) classifier have been applied for 
food image recognition [15], face recognition [16], text classification [17], and leaf recognition [18] with 
good results. 

A comparative study between CNN and BoF has been performed for leaf recognition which 
indicates that BoF is better than CNN [19] whereas CNN is better than BoF for fruit recognition [20]. Due to 
the inconsistencies in the accuracy performance of CNN and BofF, this paper conducts a comparative study 
between CNN and BoF for multiple font digit recognition. This paper is organized as follows. The next 
section explains about the overall architecture of CNN and BoF, followed by discussions on results and 
analysis of the accuracy performance. The last section concludes this paper and states the future work. 


2. OVERALL ARCHITECTURE 
2.1. Convolutional Neural Network (CNN) 

The dataset follows the imperatives given by the CNN models. One of the imperatives is with 
respect to the measure of the individual image. In this work, the CNN and BoF utilize an exact image size for 
training and testing. Other than the imperatives from the models, the equipment utilized plays an imperative 
portion on the execution. So, to test this CNN and BoF models, a high-performance laptop is utilized to 
conduct the test. The laptop used has 8 gigabytes of Random-Access Memory (RAM). 

The dataset used for this experiment 1s Chars74K dataset [21]. It comprises of images of 
handwritten characters and digits. Each of the character comprises of 55 varieties of indistinguishable 
images. Since in general data preparation takes a long time, this work covers a transcribed font digits as it 
were, which are numbers 0 to 9. This makes the full number of images utilized within the test is 1100. The 
images were resized to 227 x 227 pixels. 

Figure | shows an illustration of CNN architecture developed in this research. It starts with resized 
the images to 277x227, utilization composed of convolve, maximum pooling and ReLu layers, classification 
layer and lastly the digit is classified. This project experiments with different combination of convolve layer, 
pooling and ReLu layers to examine the effect of the number of layers to the digit recognition accuracy. The 
multiple font digit recognition models has maximum of 3 layers including five convolutional layers, three 
maxpooling or down-sampling layers and one classification layer. In the CNN, the step consistently 
convolutes the input data with multiple, different filters to extract lineaments then the subsampling layer 
summarizes the detected features into a features map [22]. 
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Figure 1. An illustration of CNN architecture 


2.2. Bag of Features (BoF) 

The Bag of Features (BOF) is one of the machine learning techniques often used in computer vision. 
It is also known as Bag of Visual Words [23]. This method is constructed from unstructured collections of 
independent visual features which come from the process of image extraction. An image can be transformed 
as a vector of features. The image is represented as a histogram of codes. 

Figure 2 shows the general framework of BoF that involves two phases [23]. The first phase is 
handled with SURF features. SURF contains interest point detector which locates the significant points in the 
image and descriptor which describes the features of the significant points and features construction of the 
interest points [24]. Second phase is encoding and pooling. Encoding is about transforming the features into 
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local according to predefined codebook based on algorithm of the training samples while pooling is 
implementing features into global representation. 
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Figure 2. BoF frameworks [23] 


3. RESULTS AND ANALYSIS 
MATLAB is used for the experiment of multi-font digit recognition and Chars74k dataset [21] is 
used for training and testing. 


3.1. Dataset 

The Chars74k has been widely used and known as character recognition benchmark [21]. In the 
dataset, symbols used in both English and Kannada are available. In the English language, Latin script 
(excluding accents) and Hindu-Arabic numerals are used. The images are normalized and centered by center 
of mass in 28x28 fields. The images contain grey levels as a result of the anti-aliasing technique. This dataset 
contains 10 classes of digits which are 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9. It consists of 770 training images and 
330 test images where each class has exactly 55 images. All of the images for each class are represented in 
different views to add robustness to the technique applied. The size of the images for each class is 128x128 
pixels and all the images were resized to 227x227 for this experiment to fit into the Matlab program for CNN 
and BoF. Figure 3 shows some sample images from Chars74K dataset. 
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Figure 3. Sample images from Chars74K dataset [21] 


3.2. Convolutional Neural Network (CNN) 

As depicted in Figure 1, our CNN is composed of 2 layers of convolve, ReLu and pooling. The 
CNN takes a gray scale image (one channel) as input similar as the work in [24]. The size of the image in the 
input layer 1s 227x227x1 pixel. Each convolutional layer convolves the output of its previous layer with a set 
of learned kernels, followed by the ReLu, and max-pooling layer. This makes convolutional networks 
computationally capable, grant them to extent to large images when the exquisite transformation can be 
implemented as a distinct convolution rather than a fully general matrix multiplication [25]. 

Table 1 lists the accuracy performance of CNN for multi-font digit recognition. Different 
combination of values of the kernel size for both convolve layers are performed to determine the best 
accuracy results. As depicted in Table 1, Chars74k datasets is tested ten times to obtain the best result based 
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on different convolve layer and learning rate. For gray scale images, the first layer and second layer use the 
same value for the convolve layer (5,27) and (5, 27), the accuracy is 0.9633% with total training and testing 
time of 5 minutes and 19 seconds. Figure 4 shows the Visual Graph for CNN of training and validation of 
data for this experiment. This is the best results for the digit recognition that produce the best performance 
that is 0.9633%. 

On the other hand, the accuracy result decreases drastically when the CNN is experimented with one 
and three layers of convolve, ReLu and pooling. Results illustrated in Table 1 indicate that different number 
of layers for CNN arrive to different results. Furthermore, the higher is the number of layers does not really 
produce better accuracy results. In this case, one layer of convolve, ReLu and pooling only extracts the low- 
level features that are the edges which is insufficient for multi-font digit recognition. In the case of three 
layers of convolve, ReLu and pooling, it obtains low accuracy since the data has been compressed so much 
after three downsized processes by the max-poling layer that the data does not really represent the digits 
anymore. 


Table 1. Accuracy Performance of CNN for Multi-Font Digit Recognition 
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Figure 4. Visual Graph for CNN with two sets of layers of convolve, ReLu and maxpooling for multi-font 
digit recognition 
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3.3. Bag of Features (BoF) 

The deployed BoF environment shown in this experiment includes the use of the Speeded up Robust 
Features (SURF) technique as a feature extractor and image encoder. For clustering, K-means algorithm is 
used. The strongest features are kept to 80 percent from each category. A vector quantization technique maps 
key-points from every training image is mapped into a unified dimensional histogram vector (Bag-of- 
Features) after K-means clustering. This histogram acts as an input vector for SVM classifier to build the 
training set. In the testing phase, the key-points are extracted and fed into the cluster model to map them into 
a BoF vector, which is finally fed into SVM training classifier to recognize the testing image. Based on the 
experiment, the average accuracy 1s 0.94. Figure 5 shows the visual word graph for BoF of the testing data 
for this experiment. 
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Figure 5. Visual Word Graph for BoF 


4. CONCLUSION 

In this research, different layers of CNN has been investigated to determine the optimum number of 
layers for multi-font digit recognition. The higher the number of layers does not really guarantee the 
achievement of a high accuracy. The performance of the CNN depends on the data itself, besides the number 
of convolve, ReLu and pooling layers, it also depends on the size of the filter image, epoch and learning rate. 
Smaller learning rate slows down the training process but may reach to better accuracy. BoF produces 
slightly lower accuracy compared to CNN. This shows that BoF can still be considered as a strong method 
for multi-font digit recognition based on the accuracy performance. Experimental results show that the CNN 
perform slightly better than BoF. In future work, more evaluations will be performed on CNN parameters for 
other datasets and different types of CNN architecture. Besides that, other features and classifiers will be 
investigated for BoF to compare the accuracy performance between machine learning and deep learning. 
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